Biostat 200C Homework 1

Due Apr 11 @ 11:59PM

Published

April 11, 2025

library(gtsummary)
Warning: package 'gtsummary' was built under R version 4.3.3
library(GGally)
library(knitr)
library(tidyverse)
library(faraway)
Warning: package 'faraway' was built under R version 4.3.3
library(gridExtra)
library(splines)
library(mlbench)
library(lmtest)
library(mice)
Warning: package 'mice' was built under R version 4.3.3
library(MASS)
library(ggplot2)
library(tidyr)
library(dplyr) 
library(corrplot)
library(ggplot2)
library(lmtest)

To submit homework, please submit Rmd and html files to bruinlearn by the deadline.

1 Q1. Reivew of linear models (60pts)

The swiss data — use Fertility as the response to practice.

1.1 Q1.1

An initial data analysis that explores the numerical and graphical characteristics of the data.(5pts)

Solution According to online sources, the Swiss Fertility and Socioeconomic Indicators dataset represent the standardized fertility measure and socioeconomic indicators for each of the 47 French-speaking provinces in Switzerland, dated in 1888.

The swiss data set is a data frame with 47 observations of 6 variables: Fertility (numeric), Agriculture (numeric), Examination (integer), Education (integer), Catholic (numeric), and Infant Mortality (numeric). The variables are represented as follows: 1. Fertility (commmon fertility measure in Ig), 2. Agriculture (percentage of males involved in agriculture as occupation), 3. Examination (percentages of draftees receiving highest mark on army examination), 4. Education (percentage eduation beyond primary school for draftees), 5. Catholic (percent of catholics as opposed to protestants), 6. Infant Mortality (live births who live less than 1 year).

data(swiss)
swiss <- swiss %>% 
  as_tibble(rownames = "Provinces") %>%
  print(width = Inf)
# A tibble: 47 × 7
   Provinces    Fertility Agriculture Examination Education Catholic
   <chr>            <dbl>       <dbl>       <int>     <int>    <dbl>
 1 Courtelary        80.2        17            15        12     9.96
 2 Delemont          83.1        45.1           6         9    84.8 
 3 Franches-Mnt      92.5        39.7           5         5    93.4 
 4 Moutier           85.8        36.5          12         7    33.8 
 5 Neuveville        76.9        43.5          17        15     5.16
 6 Porrentruy        76.1        35.3           9         7    90.6 
 7 Broye             83.8        70.2          16         7    92.8 
 8 Glane             92.4        67.8          14         8    97.2 
 9 Gruyere           82.4        53.3          12         7    97.7 
10 Sarine            82.9        45.2          16        13    91.4 
   Infant.Mortality
              <dbl>
 1             22.2
 2             22.2
 3             20.2
 4             20.3
 5             20.6
 6             26.6
 7             23.6
 8             24.9
 9             21  
10             24.4
# ℹ 37 more rows
str(swiss)
tibble [47 × 7] (S3: tbl_df/tbl/data.frame)
 $ Provinces       : chr [1:47] "Courtelary" "Delemont" "Franches-Mnt" "Moutier" ...
 $ Fertility       : num [1:47] 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
 $ Agriculture     : num [1:47] 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
 $ Examination     : int [1:47] 15 6 5 12 17 9 16 14 12 16 ...
 $ Education       : int [1:47] 12 9 5 7 15 7 7 8 7 13 ...
 $ Catholic        : num [1:47] 9.96 84.84 93.4 33.77 5.16 ...
 $ Infant.Mortality: num [1:47] 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

The summary statistics (minimum, 1st quartile, median, mean 3rd quartile, and maximum) of the variables in the swiss dataset are as follows: We are able to obtain the numerical summaries of all these predictor variables becuase they are all quantitiative.

summary(swiss) 
  Provinces           Fertility      Agriculture     Examination   
 Length:47          Min.   :35.00   Min.   : 1.20   Min.   : 3.00  
 Class :character   1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00  
 Mode  :character   Median :70.40   Median :54.10   Median :16.00  
                    Mean   :70.14   Mean   :50.66   Mean   :16.49  
                    3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00  
                    Max.   :92.50   Max.   :89.70   Max.   :37.00  
   Education        Catholic       Infant.Mortality
 Min.   : 1.00   Min.   :  2.150   Min.   :10.80   
 1st Qu.: 6.00   1st Qu.:  5.195   1st Qu.:18.15   
 Median : 8.00   Median : 15.140   Median :20.00   
 Mean   :10.98   Mean   : 41.144   Mean   :19.94   
 3rd Qu.:12.00   3rd Qu.: 93.125   3rd Qu.:21.70   
 Max.   :53.00   Max.   :100.000   Max.   :26.60   

Data Visualization Plots: Univariate

  1. Univariate plot of the Fertility Variable (response variable): The distribution appears to be slightly right skewed (positive skewed) where most provinces have moderate fertility rates and a few relatively high nes. There is a single peak, making the distribution approximately unimodal where the most common fertility rates are between 70-75. There are some provinces that have unusually low or high fertility rates (around 35 or 95).
ggplot(swiss, aes(x = Fertility)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", linewidth = 1) +
  labs(
    title = "Histogram of Fertility Rates in Swiss Provinces",
    x = "Fertility Rate",
    y = "Density"
  ) +
  theme_minimal()
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

  1. Boxplots for the Predictor Variables:

The boxplots show the distribution of the predictor variables (Agriculture, Catholic, Education, Examination, and Infant Mortality) in the swiss dataset. Brief summary of the boxplots: 1. Agriculture: The median percentage of males involved in agriculture as occupation lies between 50-60 percent, the distribution is moderately spread out, and there no clear outliers. 2. Catholic: The distribution for the percent of catholics as opposed to protestants is severely right skewed (positve skewed), the spread is extremely wide, and there are no visible outliers. This indicates that most of the provinces have a small percentage of Catholics. 3. Education: The percentage education beyond primary school for draftees is slightly right skewed (positive skewed), the median is low (between 0 and 10), there are a few notable outliers in the distribution (greater than 30) but the spread of the percentages is narrow. 4. Examination: The distribution for the percentage of draftees receiving highest marks on army examinations is slightly skewed to the right (positively skewed) with no strong outliers and the median is near the lower end of the IQR. 5. Infant Mortality: The distribution that represents the number of live births who live less than 1 year is fairly symmetrical with a slightly longer upper whisker, the range is narrow, ad the median is around 20.

predictor_vars <- c("Agriculture", "Examination", "Education", "Catholic", "Infant.Mortality")

swiss_long <- swiss %>%
  pivot_longer(cols = all_of(predictor_vars), names_to = "Variable", values_to = "Value")

ggplot(swiss_long, aes(x = "", y = Value)) +
  geom_boxplot(fill = "lightgreen") +
  facet_wrap(~ Variable, scales = "free_y") +
  labs(title = "Boxplots of Predictor Variables", x = "", y = "") +
  theme_minimal()

  1. Density plots for all variables including Fertility

For better visualization of the distributions of all the predictor variables and the response variable, Fertility, we plot the density plots. The distributions reflect the same characterisitics as the histograms above.

swiss_long <- swiss %>%
  pivot_longer(cols = -Provinces, names_to = "Variable", values_to = "Value")

ggplot(swiss_long, aes(x = Value)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free", ncol = 2) +
  labs(title = "Density Plots of Swiss Dataset Variables")

Data Visualization: Multivariable Relationships

We will create several scatter plots between 2 variables because all 6 variables are continuous. Description of the Scatter Plots: 1. Fertility versus Agricutlure: The trend below shows that the higher agriculture activity correlates with higher fertility rates (positive relationship). This aligns with the historical patterns where economies primarly based on agriculture often have larger familities due to labor needs and lower urbanization. 2. Fertility versus Education: The higher education levels often correlate with lower fertility rates. 3. Fertility versus Infant Mortality: The higher infant morality rates, the higher fertility (parents have more children to offset expected losses.) 4. Fertility versus Examination: Fertility may decline as the percentages of draftees receiving highest marks on army examination increases. 5. Fertility versus Catholic: Regions with higher Catholic populations may have higher fertility due to religious norms.

plot_list <- list()
predictors <- c("Agriculture", "Examination", "Education", "Catholic", "Infant.Mortality")
for(predictor in predictors) {
  p <- ggplot(swiss, aes_string(x = predictor, y = "Fertility")) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    ggtitle(paste("Fertility vs", predictor)) +
    theme_minimal()
  
  plot_list[[predictor]] <- p
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
grid.arrange(grobs = plot_list, ncol = 2) 

Data Visualization: Correlation Plots Description: The correlation matrix shows the pairwise comparisons of the predictor variables. We are able to see that educaiton and examination are highly positively correlated (0.70) indicating that higher education levels are assocaited with better examination outcomes. Education and Agriculture have a strong negative correlation (-0.64) suggesting that regions with higher agricultural activity tend to have lower education. The predictor variables Catholic and Fertility are negatively correlated (-0.60). Infant mortality showcases weaker correlations with the other variables.

cor_matrix <- cor(swiss[, -1]) 
corrplot(cor_matrix, 
         method = "number",     
         type = "upper",        
         order = "hclust",       
         tl.col = "black",     
         tl.srt = 45,           
         addCoef.col = "black") 

1.2 Q1.2

Variable selection to choose the best model. (10pts)

Solution We used the stepwise selection method to identify the best linear model for predicting Fertility based on the predictors: Agriculture, Examination, Education, Catholic, and Infant Mortality. The full model included all main effects as well as all possible two-way interactions (e.g., Agriculture:Examination). The step() function was applied to iteratively add and remove terms in order to minimize the Akaike Information Criterion (AIC), using both forward and backward selection.

The resulting best model achieved the lowest AIC score of 317.41 and included the following terms: Main effects: Agriculture, Examination, Education, Catholic, Infant Mortality Interaction terms: Agriculture:Examination, Agriculture:Education, Agriculture:Infant.Mortality, Examination:Education, Examination:Infant.Mortality, and Education:Catholic.

All variables in the model were statistically significant except for the interactions Agriculture:Examination and Agriculture:Education. The model has an R-squared value of 0.811, indicating that approximately 81% of the variance in the Fertility variable is explained by the model. Additionally, the overall model is highly significant, with a p-value of 1.283e-09.

full_model <- lm(Fertility ~ (Agriculture + Examination + Education + Catholic + Infant.Mortality)^2, data = swiss)
best_model <- step(full_model, direction = "both", trace = FALSE)
summary(best_model)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Education + 
    Catholic + Infant.Mortality + Agriculture:Examination + Agriculture:Education + 
    Agriculture:Infant.Mortality + Examination:Education + Examination:Infant.Mortality + 
    Education:Catholic, data = swiss)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6080 -3.6647 -0.5637  2.9216 13.7363 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  225.910055  52.457566   4.307 0.000128 ***
Agriculture                   -1.906654   0.562824  -3.388 0.001756 ** 
Examination                   -5.120204   1.578305  -3.244 0.002593 ** 
Education                     -2.473498   1.202766  -2.057 0.047249 *  
Catholic                       0.211157   0.054180   3.897 0.000420 ***
Infant.Mortality              -5.269347   2.287270  -2.304 0.027292 *  
Agriculture:Examination        0.014880   0.009277   1.604 0.117706    
Agriculture:Education          0.019081   0.013718   1.391 0.173013    
Agriculture:Infant.Mortality   0.063530   0.024021   2.645 0.012158 *  
Examination:Education          0.063891   0.030098   2.123 0.040925 *  
Examination:Infant.Mortality   0.172194   0.064887   2.654 0.011891 *  
Education:Catholic            -0.012381   0.005237  -2.364 0.023741 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.225 on 35 degrees of freedom
Multiple R-squared:  0.811, Adjusted R-squared:  0.7517 
F-statistic: 13.66 on 11 and 35 DF,  p-value: 1.283e-09
cat("AIC of the best model:", AIC(best_model), "\n")
AIC of the best model: 317.4131 
cat("Predictors in the best model:\n")
Predictors in the best model:
print(attr(terms(best_model), "term.labels"))
 [1] "Agriculture"                  "Examination"                 
 [3] "Education"                    "Catholic"                    
 [5] "Infant.Mortality"             "Agriculture:Examination"     
 [7] "Agriculture:Education"        "Agriculture:Infant.Mortality"
 [9] "Examination:Education"        "Examination:Infant.Mortality"
[11] "Education:Catholic"          

The following code performs single-term deletion diagnostics on the best model obtained earlier. It evaluates how the model’s performance would change if each term were removed individually. The results show that two interaction terms — Agriculture:Examination and Agriculture:Education — have p-values greater than 0.05, indicating they are not statistically significant.

Additionally, the RSS (Residual Sum of Squares) values associated with these terms are lower compared to the others, meaning that removing these terms results in only a slight increase (or even improvement) in model error. This suggests that these predictors do not contribute meaningfully to improving model fit, and could potentially be removed to simplify the model without significantly comprising performance.

drop1(best_model, test = "F")
Single term deletions

Model:
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality + Agriculture:Examination + Agriculture:Education + 
    Agriculture:Infant.Mortality + Examination:Education + Examination:Infant.Mortality + 
    Education:Catholic
                             Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>                                    1356.3 182.03                  
Agriculture:Examination       1    99.701 1456.0 183.37  2.5727 0.11771  
Agriculture:Education         1    74.979 1431.3 182.56  1.9348 0.17301  
Agriculture:Infant.Mortality  1   271.060 1627.4 188.60  6.9946 0.01216 *
Examination:Education         1   174.619 1531.0 185.72  4.5060 0.04092 *
Examination:Infant.Mortality  1   272.913 1629.3 188.65  7.0424 0.01189 *
Education:Catholic            1   216.628 1573.0 187.00  5.5900 0.02374 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We will further test to see if the non-signifcant interaction terms can be dropped. Because the p-value is greater than 0.05, we are able to conclude that the two interaction terms: Agriculture:Examination and Agriculture:Education can be dropped.

reduced_model <- update(best_model, . ~ . - Agriculture:Examination - Agriculture:Education)
anova(reduced_model, best_model)  
Analysis of Variance Table

Model 1: Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality + Agriculture:Infant.Mortality + Examination:Education + 
    Examination:Infant.Mortality + Education:Catholic
Model 2: Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality + Agriculture:Examination + Agriculture:Education + 
    Agriculture:Infant.Mortality + Examination:Education + Examination:Infant.Mortality + 
    Education:Catholic
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     37 1609.4                              
2     35 1356.3  2    253.03 3.2647 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Refitting the final model after dropping the two nonsignificant interaction terms gives us the following: \[y=160.178 −1.217⋅Agriculture −3.037⋅Examination −0.501⋅Education +0.179⋅Catholic −3.462⋅Infant.Mortality +0.051⋅(Agriculture×Infant.Mortality) +0.004⋅(Examination×Education) +0.127⋅(Examination×Infant.Mortality) −0.012⋅(Education×Catholic) \]

final_model <- lm(
  Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality +
              Agriculture:Infant.Mortality + Examination:Education +
              Examination:Infant.Mortality + Education:Catholic,
  data = swiss
)
summary(final_model)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Education + 
    Catholic + Infant.Mortality + Agriculture:Infant.Mortality + 
    Examination:Education + Examination:Infant.Mortality + Education:Catholic, 
    data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0262  -5.0655   0.3806   3.3900  13.3586 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  160.177723  46.810474   3.422 0.001532 ** 
Agriculture                   -1.216858   0.512381  -2.375 0.022850 *  
Examination                   -3.037307   1.290117  -2.354 0.023977 *  
Education                     -0.501068   0.460578  -1.088 0.283670    
Catholic                       0.178965   0.047934   3.734 0.000633 ***
Infant.Mortality              -3.461777   2.203667  -1.571 0.124715    
Agriculture:Infant.Mortality   0.050713   0.024683   2.055 0.047032 *  
Examination:Education          0.004342   0.011049   0.393 0.696592    
Examination:Infant.Mortality   0.126780   0.062368   2.033 0.049297 *  
Education:Catholic            -0.011606   0.005146  -2.255 0.030126 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.595 on 37 degrees of freedom
Multiple R-squared:  0.7758,    Adjusted R-squared:  0.7213 
F-statistic: 14.22 on 9 and 37 DF,  p-value: 1.466e-09
AIC(final_model)
[1] 321.4526

1.3 Q1.3

An exploration of transformations to improve the fit of the model. (10pts)

Solution

To improve model fit, we perform a search to identify the best-fitting linear regression model for predicting the Fertility variable in the swiss dataset by testing different transformations (identity, log, inverse, B-splines, and orthogonal polynomials) which are applied to the five predictor variables (Agriculture, Examination, Education, Catholic, and Infant.Mortality) based on how the distribution of each variable looks.

  1. We will first impose a log transformation on the variable Agriculture. The density plot of the predictor Agriculture showcases that the distribution is right skewed and a log transformation can help normalize the predictor.
final_model_agriculture <- lm(
  Fertility ~ log(Agriculture + 1) + Examination + Education + Catholic + Infant.Mortality,
  data = swiss)

par(mfrow = c(1, 2))

termplot(final_model, 
         partial.resid = TRUE,
         terms = "Agriculture",
         main = "Original Agriculture",
         col.term = "red",
         col.res = "gray60")

termplot(final_model_agriculture,
         partial.resid = TRUE,
         terms = "log(Agriculture + 1)", 
         main = "Log-Transformed Agriculture",
         col.term = "blue",
         col.res = "gray60")

par(mfrow = c(1, 1))
AIC(final_model, final_model_agriculture)
                        df      AIC
final_model             11 321.4526
final_model_agriculture  7 328.3404

The log-transformed plot adds curvature and introduces more scatter and a less precise fit in the partial residuals. In addition, we are able to see from the AIC scores of the original model and the log-transformed plot (only Agriculture) that the original model has better overall model fit. Therefore, we will explore other transformations.

  1. Next, we see that the distribution for examination appears bimodal and therefore we will impose a polynomial trasnformation to capture its nonlinearity.
final_model_examination <- lm(
  Fertility ~ Agriculture + poly(Examination, degree = 2) + Education + Catholic + Infant.Mortality,
  data = swiss)

par(mfrow = c(1, 2))

termplot(final_model, 
         partial.resid = TRUE,
         terms = "Examination",
         main = "Original Examination",
         col.term = "red",
         col.res = "gray60")

termplot(final_model_examination,
         partial.resid = TRUE,
         terms = "poly(Examination, degree = 2)", 
         main = "Poly-Transformed Examination",
         col.term = "blue",
         col.res = "gray60")

par(mfrow = c(1, 1))
AIC(final_model, final_model_examination)
                        df      AIC
final_model             11 321.4526
final_model_examination  8 326.2338

The plot for the poly-transformed examination shows that the curvature is not very strong as the partial residuals appear scattered and the line does not clearly capture a stronger pattern. We are able to interpret this as the polynomial term overfitting or adding unncessary complexity. In addition, the AIC score for the final model after transforming the examination predictor variable is greater than that of the AIC score for the original model. Therefore, although the Examnation predictor seems to be bimodal, it is better to not transform it.

  1. Next, we see that the distribution for education is right skewed. Therefore, we will impose a log transformation on Education.
final_model_education <- lm(
  Fertility ~ Agriculture + Examination + log(Education + 1) + Catholic + Infant.Mortality,
  data = swiss)

par(mfrow = c(1, 2))

termplot(final_model, 
         partial.resid = TRUE,
         terms = "Education",
         main = "Original Education",
         col.term = "red",
         col.res = "gray60")

termplot(final_model_education,
         partial.resid = TRUE,
         terms = "log(Education + 1)", 
         main = "Log Transformed Education",
         col.term = "blue",
         col.res = "gray60")

par(mfrow = c(1, 1))
AIC(final_model, final_model_education)
                      df      AIC
final_model           11 321.4526
final_model_education  7 340.4162

The termplot for “Log-Transformed Education” shows that the curve is nonlinear, flattening out as Education increases. The residuals appear more scattered and the trend is less distinct. The curvature does not enhenace interpretability of the model fit, despite the Education predictor being right skewed. The AIC score for the model with the log transformed Education predictor is also higher than that of the original model. Therefore, we can conclude that the transformation is not necessary.

  1. Next, we see that the distribution for the variable Catholic has two distinct peaks. Therefore, we will impose B-splines on this predictor variable to capture its nonlinearity and model this multi-modal pattern.
final_model_catholic <- lm(
  Fertility ~ Agriculture + Examination + Education +  bs(Catholic, df = 3) + Infant.Mortality,
  data = swiss)

par(mfrow = c(1, 2))
termplot(final_model, 
         partial.resid = TRUE,
         terms = "Catholic",
         main = "Original Catholic",
         col.term = "red",
         col.res = "gray60")

termplot(final_model_catholic,
         partial.resid = TRUE,
         terms = "bs(Catholic, df = 3)", 
         main = "Spline Transformed Catholic",
         col.term = "blue",
         col.res = "gray60")

par(mfrow = c(1, 1))
AIC(final_model, final_model_catholic)
                     df      AIC
final_model          11 321.4526
final_model_catholic  9 324.3366

We are able to see that the B-spline (df = 3) introduces flexibility, allowing for a nonlinear curve which potentially better reflects multimodal patterns, especially around the extreme values of the Catholic variable. However, when comparing the AIC scores between the original model and the B-spline transformed, model, we see that the original model has a lower score. Therefore, a transformation on the Catholic variable is not necessary.

  1. Lastly, we see that the distribution for the Infant Mortality variable is somewhat normal with a right tail. Therefore, we will impose a squared root transformation on this predictor variable to capture its nonlinearity.
final_model_infant_mortality <- lm(
  Fertility ~ Agriculture + Examination + Education +  Catholic + sqrt(Infant.Mortality),
  data = swiss)

par(mfrow = c(1, 2))

termplot(final_model, 
         partial.resid = TRUE,
         terms = "Infant.Mortality",
         main = "Original Infant Mortality",
         col.term = "red",
         col.res = "gray60")

termplot(final_model_infant_mortality,
         partial.resid = TRUE,
         terms = "sqrt(Infant.Mortality)", 
         main = "Square Root Transformed Infant Mortality",
         col.term = "blue",
         col.res = "gray60")

par(mfrow = c(1, 1))
AIC(final_model, final_model_infant_mortality)
                             df      AIC
final_model                  11 321.4526
final_model_infant_mortality  7 326.2892

The termplot for the square root transformed variable for Infant Mortality introduces nonlinearity, curving upward slightly. We are able to see that the partial residuals are more dispersed and less tightly aligned to the trend. The visual signal appears weaker and less interpretable. In addition, the AIC score for the square root transformed model is greater than that of the original model. Therefore, we can conclude that the square root transformation is not necessary.

After analysis on the effects of these transformations, we conclude that the best final model is characterized by:

final_best_model <- lm(
  Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality +
              Agriculture:Infant.Mortality + Examination:Education +
              Examination:Infant.Mortality + Education:Catholic,
  data = swiss
)
summary(final_best_model)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Education + 
    Catholic + Infant.Mortality + Agriculture:Infant.Mortality + 
    Examination:Education + Examination:Infant.Mortality + Education:Catholic, 
    data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0262  -5.0655   0.3806   3.3900  13.3586 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  160.177723  46.810474   3.422 0.001532 ** 
Agriculture                   -1.216858   0.512381  -2.375 0.022850 *  
Examination                   -3.037307   1.290117  -2.354 0.023977 *  
Education                     -0.501068   0.460578  -1.088 0.283670    
Catholic                       0.178965   0.047934   3.734 0.000633 ***
Infant.Mortality              -3.461777   2.203667  -1.571 0.124715    
Agriculture:Infant.Mortality   0.050713   0.024683   2.055 0.047032 *  
Examination:Education          0.004342   0.011049   0.393 0.696592    
Examination:Infant.Mortality   0.126780   0.062368   2.033 0.049297 *  
Education:Catholic            -0.011606   0.005146  -2.255 0.030126 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.595 on 37 degrees of freedom
Multiple R-squared:  0.7758,    Adjusted R-squared:  0.7213 
F-statistic: 14.22 on 9 and 37 DF,  p-value: 1.466e-09
AIC(final_best_model)
[1] 321.4526

\[ Fertility = 160.178+(−1.217)⋅Agriculture+(−3.037)⋅Examination+(−0.501)⋅Education+0.179⋅Catholic+(−3.462)⋅Infant.Mortality + 0.051⋅(Agriculture×Infant.Mortality)+0.004⋅(Examination×Education)+0.127⋅(Examination×Infant.Mortality)+(−0.012)⋅(Education×Catholic) \]

1.4 Q1.4

Diagnostics to check the assumptions of your model. (10pts)

Solution 1. Residuals versus Fitted Values: The residuals appear randomly scattered around 0 (horizontal line), suggesting no severe non-linearity. We are able to see that there is a slight funnel shape (wider spread at higher fitted values) which may indicate mild heteroscedasticity (non-constant variance).

  1. Scale Location (Spread-Location): We will now assess the homoscedasticity (constant variance of residuals). We are able to see that the points are roughly horizontal with minor fluctuations, which suggests moderate homoscedasticity.

  2. Q-Q Residuals: We will test the normality of residuals. In the graph, the points mostly follow the dashed 45 degree line, indicating approximate normality. Minor deviations in the tails are not severe enough to be a problem.

  3. Residuals versus Leverage: We will plot this graph to identify influential outliers (high leverage + high residuals). We are able to see that the no points fall outside Cook’s distance contours (red dashed lines), suggesting that there are no highly influential outliers. There a few points with high leverage but these modest residuals pose little risk.

par(mfrow = c(2, 2))
plot(final_best_model)

par(mfrow = c(1, 1))

The results of the Shapiro Wilk normality test are shown below. The p-value is 0.2238. Therefore, we fail to reject the null hypothesis that the residuals are normally distributed. Therefore, this suggests that the model’s residuals do not significantly deviate from normality.

shapiro.test(residuals(final_best_model)) 

    Shapiro-Wilk normality test

data:  residuals(final_best_model)
W = 0.96809, p-value = 0.2238

Lastly, we will run the Studentized Breusch-Pagan test to check for heteroscedasticity (non-constant variance of residuals) in a linear regression model. Because the residual plots from above hinted at heteroscedastcity, we will run this test to ensure that the residuals have constant variance. The p-value resulting from the Breusch-Pagan Test is 0.2305so we fail to reject the null hypothesis. Therefore, we are able to conclude that the residuals have constant variance.

bptest(final_best_model) 

    studentized Breusch-Pagan test

data:  final_best_model
BP = 11.704, df = 9, p-value = 0.2305

1.5 Q1.5

Some predictions of future observations for interesting values of the predictors.(5pts)

Solution We first created a data frame containing various combinations of the predictor variables. Specifically, we used values of 10, 50, and 90 for Agriculture; a sequence of scores from 3 to 30 (by 5) for Examination; values of 5, 15, and 25 for Education; values of 10, 50, and 90 for Catholic; and values of 10, 20, and 30 for Infant.Mortality. Using this grid, we applied the final_best_model to generate predicted Fertility values for each combination, along with corresponding 95% confidence intervals for the mean predictions. The resulting data frame is displayed below.

new_data <- expand.grid(
  Agriculture = c(10, 50, 90),        
  Examination = seq(3, 30, by = 5),  
  Education = c(5, 15, 25),          
  Catholic = c(10, 50, 90),          
  Infant.Mortality = c(10, 20, 30)) 

predictions <- predict(final_best_model, 
                       newdata = new_data,
                       interval = "confidence")

results <- cbind(new_data, predictions)
results
    Agriculture Examination Education Catholic Infant.Mortality         fit
1            10           3         5       10               10 111.9232757
2            50           3         5       10               10  83.5340539
3            90           3         5       10               10  55.1448320
4            10           8         5       10               10 103.1843077
5            50           8         5       10               10  74.7950858
6            90           8         5       10               10  46.4058639
7            10          13         5       10               10  94.4453396
8            50          13         5       10               10  66.0561178
9            90          13         5       10               10  37.6668959
10           10          18         5       10               10  85.7063716
11           50          18         5       10               10  57.3171497
12           90          18         5       10               10  28.9279278
13           10          23         5       10               10  76.9674035
14           50          23         5       10               10  48.5781817
15           90          23         5       10               10  20.1889598
16           10          28         5       10               10  68.2284355
17           50          28         5       10               10  39.8392136
18           90          28         5       10               10  11.4499917
19           10           3        15       10               10 105.8822603
20           50           3        15       10               10  77.4930384
21           90           3        15       10               10  49.1038165
22           10           8        15       10               10  97.3603911
23           50           8        15       10               10  68.9711692
24           90           8        15       10               10  40.5819473
25           10          13        15       10               10  88.8385219
26           50          13        15       10               10  60.4493000
27           90          13        15       10               10  32.0600781
28           10          18        15       10               10  80.3166527
29           50          18        15       10               10  51.9274308
30           90          18        15       10               10  23.5382089
31           10          23        15       10               10  71.7947835
32           50          23        15       10               10  43.4055616
33           90          23        15       10               10  15.0163397
34           10          28        15       10               10  63.2729143
35           50          28        15       10               10  34.8836924
36           90          28        15       10               10   6.4944705
37           10           3        25       10               10  99.8412448
38           50           3        25       10               10  71.4520230
39           90           3        25       10               10  43.0628011
40           10           8        25       10               10  91.5364745
41           50           8        25       10               10  63.1472526
42           90           8        25       10               10  34.7580307
43           10          13        25       10               10  83.2317041
44           50          13        25       10               10  54.8424822
45           90          13        25       10               10  26.4532603
46           10          18        25       10               10  74.9269338
47           50          18        25       10               10  46.5377119
48           90          18        25       10               10  18.1484900
49           10          23        25       10               10  66.6221634
50           50          23        25       10               10  38.2329415
51           90          23        25       10               10   9.8437196
52           10          28        25       10               10  58.3173931
53           50          28        25       10               10  29.9281712
54           90          28        25       10               10   1.5389493
55           10           3         5       50               10 116.7606935
56           50           3         5       50               10  88.3714716
57           90           3         5       50               10  59.9822497
58           10           8         5       50               10 108.0217254
59           50           8         5       50               10  79.6325035
60           90           8         5       50               10  51.2432816
61           10          13         5       50               10  99.2827574
62           50          13         5       50               10  70.8935355
63           90          13         5       50               10  42.5043136
64           10          18         5       50               10  90.5437893
65           50          18         5       50               10  62.1545674
66           90          18         5       50               10  33.7653455
67           10          23         5       50               10  81.8048213
68           50          23         5       50               10  53.4155994
69           90          23         5       50               10  25.0263775
70           10          28         5       50               10  73.0658532
71           50          28         5       50               10  44.6766313
72           90          28         5       50               10  16.2874094
73           10           3        15       50               10 106.0773167
74           50           3        15       50               10  77.6880948
75           90           3        15       50               10  49.2988729
76           10           8        15       50               10  97.5554475
77           50           8        15       50               10  69.1662256
78           90           8        15       50               10  40.7770037
79           10          13        15       50               10  89.0335783
80           50          13        15       50               10  60.6443564
81           90          13        15       50               10  32.2551345
82           10          18        15       50               10  80.5117091
83           50          18        15       50               10  52.1224872
84           90          18        15       50               10  23.7332653
85           10          23        15       50               10  71.9898399
86           50          23        15       50               10  43.6006180
87           90          23        15       50               10  15.2113961
88           10          28        15       50               10  63.4679707
89           50          28        15       50               10  35.0787488
90           90          28        15       50               10   6.6895269
91           10           3        25       50               10  95.3939400
92           50           3        25       50               10  67.0047181
93           90           3        25       50               10  38.6154962
94           10           8        25       50               10  87.0891696
95           50           8        25       50               10  58.6999477
96           90           8        25       50               10  30.3107258
97           10          13        25       50               10  78.7843993
98           50          13        25       50               10  50.3951774
99           90          13        25       50               10  22.0059555
100          10          18        25       50               10  70.4796289
101          50          18        25       50               10  42.0904070
102          90          18        25       50               10  13.7011851
103          10          23        25       50               10  62.1748585
104          50          23        25       50               10  33.7856366
105          90          23        25       50               10   5.3964148
106          10          28        25       50               10  53.8700882
107          50          28        25       50               10  25.4808663
108          90          28        25       50               10  -2.9083556
109          10           3         5       90               10 121.5981112
110          50           3         5       90               10  93.2088893
111          90           3         5       90               10  64.8196674
112          10           8         5       90               10 112.8591431
113          50           8         5       90               10  84.4699212
114          90           8         5       90               10  56.0806993
115          10          13         5       90               10 104.1201751
116          50          13         5       90               10  75.7309532
117          90          13         5       90               10  47.3417313
118          10          18         5       90               10  95.3812070
119          50          18         5       90               10  66.9919851
120          90          18         5       90               10  38.6027632
121          10          23         5       90               10  86.6422390
122          50          23         5       90               10  58.2530171
123          90          23         5       90               10  29.8637952
124          10          28         5       90               10  77.9032709
125          50          28         5       90               10  49.5140490
126          90          28         5       90               10  21.1248271
127          10           3        15       90               10 106.2723731
128          50           3        15       90               10  77.8831512
129          90           3        15       90               10  49.4939294
130          10           8        15       90               10  97.7505039
131          50           8        15       90               10  69.3612820
132          90           8        15       90               10  40.9720602
133          10          13        15       90               10  89.2286347
134          50          13        15       90               10  60.8394128
135          90          13        15       90               10  32.4501909
136          10          18        15       90               10  80.7067655
137          50          18        15       90               10  52.3175436
138          90          18        15       90               10  23.9283217
139          10          23        15       90               10  72.1848963
140          50          23        15       90               10  43.7956744
141          90          23        15       90               10  15.4064525
142          10          28        15       90               10  63.6630271
143          50          28        15       90               10  35.2738052
144          90          28        15       90               10   6.8845833
145          10           3        25       90               10  90.9466351
146          50           3        25       90               10  62.5574132
147          90           3        25       90               10  34.1681913
148          10           8        25       90               10  82.6418647
149          50           8        25       90               10  54.2526428
150          90           8        25       90               10  25.8634210
151          10          13        25       90               10  74.3370944
152          50          13        25       90               10  45.9478725
153          90          13        25       90               10  17.5586506
154          10          18        25       90               10  66.0323240
155          50          18        25       90               10  37.6431021
156          90          18        25       90               10   9.2538802
157          10          23        25       90               10  57.7275537
158          50          23        25       90               10  29.3383318
159          90          23        25       90               10   0.9491099
160          10          28        25       90               10  49.4227833
161          50          28        25       90               10  21.0335614
162          90          28        25       90               10  -7.3556605
163          10           3         5       10               20  86.1801818
164          50           3         5       10               20  78.0760397
165          90           3         5       10               20  69.9718976
166          10           8         5       10               20  83.7802300
167          50           8         5       10               20  75.6760878
168          90           8         5       10               20  67.5719457
169          10          13         5       10               20  81.3802781
170          50          13         5       10               20  73.2761360
171          90          13         5       10               20  65.1719939
172          10          18         5       10               20  78.9803262
173          50          18         5       10               20  70.8761841
174          90          18         5       10               20  62.7720420
175          10          23         5       10               20  76.5803743
176          50          23         5       10               20  68.4762322
177          90          23         5       10               20  60.3720901
178          10          28         5       10               20  74.1804225
179          50          28         5       10               20  66.0762803
180          90          28         5       10               20  57.9721382
181          10           3        15       10               20  80.1391664
182          50           3        15       10               20  72.0350243
183          90           3        15       10               20  63.9308822
184          10           8        15       10               20  77.9563134
185          50           8        15       10               20  69.8521712
186          90           8        15       10               20  61.7480291
187          10          13        15       10               20  75.7734603
188          50          13        15       10               20  67.6693182
189          90          13        15       10               20  59.5651761
190          10          18        15       10               20  73.5906073
191          50          18        15       10               20  65.4864652
192          90          18        15       10               20  57.3823231
193          10          23        15       10               20  71.4077543
194          50          23        15       10               20  63.3036122
195          90          23        15       10               20  55.1994700
196          10          28        15       10               20  69.2249012
197          50          28        15       10               20  61.1207591
198          90          28        15       10               20  53.0166170
199          10           3        25       10               20  74.0981509
200          50           3        25       10               20  65.9940088
201          90           3        25       10               20  57.8898667
202          10           8        25       10               20  72.1323968
203          50           8        25       10               20  64.0282546
204          90           8        25       10               20  55.9241125
205          10          13        25       10               20  70.1666426
206          50          13        25       10               20  62.0625005
207          90          13        25       10               20  53.9583583
208          10          18        25       10               20  68.2008884
209          50          18        25       10               20  60.0967463
210          90          18        25       10               20  51.9926042
211          10          23        25       10               20  66.2351342
212          50          23        25       10               20  58.1309921
213          90          23        25       10               20  50.0268500
214          10          28        25       10               20  64.2693800
215          50          28        25       10               20  56.1652379
216          90          28        25       10               20  48.0610958
217          10           3         5       50               20  91.0175995
218          50           3         5       50               20  82.9134574
219          90           3         5       50               20  74.8093153
220          10           8         5       50               20  88.6176477
221          50           8         5       50               20  80.5135056
222          90           8         5       50               20  72.4093634
223          10          13         5       50               20  86.2176958
224          50          13         5       50               20  78.1135537
225          90          13         5       50               20  70.0094116
226          10          18         5       50               20  83.8177439
227          50          18         5       50               20  75.7136018
228          90          18         5       50               20  67.6094597
229          10          23         5       50               20  81.4177921
230          50          23         5       50               20  73.3136499
231          90          23         5       50               20  65.2095078
232          10          28         5       50               20  79.0178402
233          50          28         5       50               20  70.9136981
234          90          28         5       50               20  62.8095559
235          10           3        15       50               20  80.3342228
236          50           3        15       50               20  72.2300807
237          90           3        15       50               20  64.1259386
238          10           8        15       50               20  78.1513698
239          50           8        15       50               20  70.0472277
240          90           8        15       50               20  61.9430855
241          10          13        15       50               20  75.9685167
242          50          13        15       50               20  67.8643746
243          90          13        15       50               20  59.7602325
244          10          18        15       50               20  73.7856637
245          50          18        15       50               20  65.6815216
246          90          18        15       50               20  57.5773795
247          10          23        15       50               20  71.6028107
248          50          23        15       50               20  63.4986686
249          90          23        15       50               20  55.3945265
250          10          28        15       50               20  69.4199577
251          50          28        15       50               20  61.3158155
252          90          28        15       50               20  53.2116734
253          10           3        25       50               20  69.6508461
254          50           3        25       50               20  61.5467039
255          90           3        25       50               20  53.4425618
256          10           8        25       50               20  67.6850919
257          50           8        25       50               20  59.5809498
258          90           8        25       50               20  51.4768076
259          10          13        25       50               20  65.7193377
260          50          13        25       50               20  57.6151956
261          90          13        25       50               20  49.5110535
262          10          18        25       50               20  63.7535835
263          50          18        25       50               20  55.6494414
264          90          18        25       50               20  47.5452993
265          10          23        25       50               20  61.7878293
266          50          23        25       50               20  53.6836872
267          90          23        25       50               20  45.5795451
268          10          28        25       50               20  59.8220752
269          50          28        25       50               20  51.7179330
270          90          28        25       50               20  43.6137909
271          10           3         5       90               20  95.8550173
272          50           3         5       90               20  87.7508751
273          90           3         5       90               20  79.6467330
274          10           8         5       90               20  93.4550654
275          50           8         5       90               20  85.3509233
276          90           8         5       90               20  77.2467812
277          10          13         5       90               20  91.0551135
278          50          13         5       90               20  82.9509714
279          90          13         5       90               20  74.8468293
280          10          18         5       90               20  88.6551616
281          50          18         5       90               20  80.5510195
282          90          18         5       90               20  72.4468774
283          10          23         5       90               20  86.2552098
284          50          23         5       90               20  78.1510676
285          90          23         5       90               20  70.0469255
286          10          28         5       90               20  83.8552579
287          50          28         5       90               20  75.7511158
288          90          28         5       90               20  67.6469737
289          10           3        15       90               20  80.5292792
290          50           3        15       90               20  72.4251371
291          90           3        15       90               20  64.3209950
292          10           8        15       90               20  78.3464262
293          50           8        15       90               20  70.2422841
294          90           8        15       90               20  62.1381420
295          10          13        15       90               20  76.1635732
296          50          13        15       90               20  68.0594311
297          90          13        15       90               20  59.9552889
298          10          18        15       90               20  73.9807201
299          50          18        15       90               20  65.8765780
300          90          18        15       90               20  57.7724359
301          10          23        15       90               20  71.7978671
302          50          23        15       90               20  63.6937250
303          90          23        15       90               20  55.5895829
304          10          28        15       90               20  69.6150141
305          50          28        15       90               20  61.5108720
306          90          28        15       90               20  53.4067298
307          10           3        25       90               20  65.2035412
308          50           3        25       90               20  57.0993991
309          90           3        25       90               20  48.9952570
310          10           8        25       90               20  63.2377870
311          50           8        25       90               20  55.1336449
312          90           8        25       90               20  47.0295028
313          10          13        25       90               20  61.2720328
314          50          13        25       90               20  53.1678907
315          90          13        25       90               20  45.0637486
316          10          18        25       90               20  59.3062786
317          50          18        25       90               20  51.2021365
318          90          18        25       90               20  43.0979944
319          10          23        25       90               20  57.3405245
320          50          23        25       90               20  49.2363823
321          90          23        25       90               20  41.1322402
322          10          28        25       90               20  55.3747703
323          50          28        25       90               20  47.2706282
324          90          28        25       90               20  39.1664860
325          10           3         5       10               30  60.4370879
326          50           3         5       10               30  72.6180256
327          90           3         5       10               30  84.7989632
328          10           8         5       10               30  64.3761522
329          50           8         5       10               30  76.5570899
330          90           8         5       10               30  88.7380275
331          10          13         5       10               30  68.3152165
332          50          13         5       10               30  80.4961542
333          90          13         5       10               30  92.6770918
334          10          18         5       10               30  72.2542808
335          50          18         5       10               30  84.4352185
336          90          18         5       10               30  96.6161561
337          10          23         5       10               30  76.1933451
338          50          23         5       10               30  88.3742828
339          90          23         5       10               30 100.5552204
340          10          28         5       10               30  80.1324094
341          50          28         5       10               30  92.3133471
342          90          28         5       10               30 104.4942847
343          10           3        15       10               30  54.3960725
344          50           3        15       10               30  66.5770101
345          90           3        15       10               30  78.7579478
346          10           8        15       10               30  58.5522356
347          50           8        15       10               30  70.7331733
348          90           8        15       10               30  82.9141109
349          10          13        15       10               30  62.7083988
350          50          13        15       10               30  74.8893364
351          90          13        15       10               30  87.0702741
352          10          18        15       10               30  66.8645619
353          50          18        15       10               30  79.0454996
354          90          18        15       10               30  91.2264372
355          10          23        15       10               30  71.0207251
356          50          23        15       10               30  83.2016627
357          90          23        15       10               30  95.3826004
358          10          28        15       10               30  75.1768882
359          50          28        15       10               30  87.3578259
360          90          28        15       10               30  99.5387635
361          10           3        25       10               30  48.3550570
362          50           3        25       10               30  60.5359947
363          90           3        25       10               30  72.7169323
364          10           8        25       10               30  52.7283190
365          50           8        25       10               30  64.9092567
366          90           8        25       10               30  77.0901943
367          10          13        25       10               30  57.1015810
368          50          13        25       10               30  69.2825187
369          90          13        25       10               30  81.4634563
370          10          18        25       10               30  61.4748430
371          50          18        25       10               30  73.6557807
372          90          18        25       10               30  85.8367183
373          10          23        25       10               30  65.8481050
374          50          23        25       10               30  78.0290427
375          90          23        25       10               30  90.2099803
376          10          28        25       10               30  70.2213670
377          50          28        25       10               30  82.4023047
378          90          28        25       10               30  94.5832423
379          10           3         5       50               30  65.2745056
380          50           3         5       50               30  77.4554433
381          90           3         5       50               30  89.6363809
382          10           8         5       50               30  69.2135699
383          50           8         5       50               30  81.3945076
384          90           8         5       50               30  93.5754452
385          10          13         5       50               30  73.1526342
386          50          13         5       50               30  85.3335719
387          90          13         5       50               30  97.5145096
388          10          18         5       50               30  77.0916985
389          50          18         5       50               30  89.2726362
390          90          18         5       50               30 101.4535739
391          10          23         5       50               30  81.0307628
392          50          23         5       50               30  93.2117005
393          90          23         5       50               30 105.3926382
394          10          28         5       50               30  84.9698271
395          50          28         5       50               30  97.1507648
396          90          28         5       50               30 109.3317025
397          10           3        15       50               30  54.5911289
398          50           3        15       50               30  66.7720665
399          90           3        15       50               30  78.9530042
400          10           8        15       50               30  58.7472920
401          50           8        15       50               30  70.9282297
402          90           8        15       50               30  83.1091674
403          10          13        15       50               30  62.9034552
404          50          13        15       50               30  75.0843928
405          90          13        15       50               30  87.2653305
406          10          18        15       50               30  67.0596183
407          50          18        15       50               30  79.2405560
408          90          18        15       50               30  91.4214936
409          10          23        15       50               30  71.2157815
410          50          23        15       50               30  83.3967191
411          90          23        15       50               30  95.5776568
412          10          28        15       50               30  75.3719446
413          50          28        15       50               30  87.5528823
414          90          28        15       50               30  99.7338199
415          10           3        25       50               30  43.9077522
416          50           3        25       50               30  56.0886898
417          90           3        25       50               30  68.2696275
418          10           8        25       50               30  48.2810141
419          50           8        25       50               30  60.4619518
420          90           8        25       50               30  72.6428895
421          10          13        25       50               30  52.6542761
422          50          13        25       50               30  64.8352138
423          90          13        25       50               30  77.0161514
424          10          18        25       50               30  57.0275381
425          50          18        25       50               30  69.2084758
426          90          18        25       50               30  81.3894134
427          10          23        25       50               30  61.4008001
428          50          23        25       50               30  73.5817378
429          90          23        25       50               30  85.7626754
430          10          28        25       50               30  65.7740621
431          50          28        25       50               30  77.9549998
432          90          28        25       50               30  90.1359374
433          10           3         5       90               30  70.1119234
434          50           3         5       90               30  82.2928610
435          90           3         5       90               30  94.4737987
436          10           8         5       90               30  74.0509877
437          50           8         5       90               30  86.2319253
438          90           8         5       90               30  98.4128630
439          10          13         5       90               30  77.9900520
440          50          13         5       90               30  90.1709896
441          90          13         5       90               30 102.3519273
442          10          18         5       90               30  81.9291163
443          50          18         5       90               30  94.1100539
444          90          18         5       90               30 106.2909916
445          10          23         5       90               30  85.8681806
446          50          23         5       90               30  98.0491182
447          90          23         5       90               30 110.2300559
448          10          28         5       90               30  89.8072449
449          50          28         5       90               30 101.9881825
450          90          28         5       90               30 114.1691202
451          10           3        15       90               30  54.7861853
452          50           3        15       90               30  66.9671230
453          90           3        15       90               30  79.1480606
454          10           8        15       90               30  58.9423485
455          50           8        15       90               30  71.1232861
456          90           8        15       90               30  83.3042238
457          10          13        15       90               30  63.0985116
458          50          13        15       90               30  75.2794493
459          90          13        15       90               30  87.4603869
460          10          18        15       90               30  67.2546748
461          50          18        15       90               30  79.4356124
462          90          18        15       90               30  91.6165501
463          10          23        15       90               30  71.4108379
464          50          23        15       90               30  83.5917756
465          90          23        15       90               30  95.7727132
466          10          28        15       90               30  75.5670011
467          50          28        15       90               30  87.7479387
468          90          28        15       90               30  99.9288764
469          10           3        25       90               30  39.4604473
470          50           3        25       90               30  51.6413849
471          90           3        25       90               30  63.8223226
472          10           8        25       90               30  43.8337093
473          50           8        25       90               30  56.0146469
474          90           8        25       90               30  68.1955846
475          10          13        25       90               30  48.2069713
476          50          13        25       90               30  60.3879089
477          90          13        25       90               30  72.5688466
478          10          18        25       90               30  52.5802333
479          50          18        25       90               30  64.7611709
480          90          18        25       90               30  76.9421086
481          10          23        25       90               30  56.9534953
482          50          23        25       90               30  69.1344329
483          90          23        25       90               30  81.3153706
484          10          28        25       90               30  61.3267572
485          50          28        25       90               30  73.5076949
486          90          28        25       90               30  85.6886326
             lwr       upr
1    70.42601667 153.42053
2    61.29870975 105.76940
3    40.21485236  70.07481
4    68.08989456 138.27872
5    58.83396275  90.75621
6    31.10871344  61.70301
7    65.47961667 123.41106
8    55.43047316  76.68176
9    19.14930190  56.18449
10   62.37809833 109.03464
11   49.01949044  65.61481
12    5.48438473  52.37147
13   58.33400579  95.60080
14   37.50578681  59.65058
15   -9.03599688  49.41392
16   52.48046284  83.97641
17   23.28189926  56.39653
18  -23.99569097  46.89567
19   64.20607363 147.55845
20   54.63801014 100.34807
21   32.85863735  65.34900
22   62.28340465 132.43738
23   52.61182390  85.33051
24   24.42747032  56.73642
25   60.13299401 117.54405
26   49.81793362  71.08067
27   13.13521831  50.98494
28   57.56271688 103.07059
29   44.27591892  59.57894
30   -0.02989285  47.10631
31   54.14265134  89.44692
32   33.19834710  53.61278
33  -14.18781126  44.22049
34   48.93499761  77.61083
35   19.07386215  50.69352
36  -28.86696835  41.85591
37   56.42804993 143.25444
38   45.31367870  97.59037
39   22.13850096  63.98710
40   54.98686186 128.08609
41   43.46982978  82.82468
42   14.89264620  54.62342
43   53.32353377 113.13987
44   40.74834593  68.93662
45    5.03139185  47.87513
46   51.25038618  98.60348
47   35.69593302  57.37949
48   -6.96356247  43.26054
49   48.34335221  84.90097
50   26.23713224  50.22875
51  -20.31900055  40.00644
52   43.65087496  72.98391
53   13.26352317  46.59282
54  -34.46688331  37.54478
55   75.40327304 158.11811
56   66.42311604 110.31983
57   45.52091695  74.44358
58   72.99471703 143.04873
59   63.85476595  95.41024
60   36.17382788  66.31274
61   70.28023401 128.28528
62   60.21909652  81.56797
63   23.98819410  61.02043
64   67.02355637 114.06402
65   53.39384579  70.91529
66   10.17710955  57.35358
67   62.75078478 100.85886
68   41.69566796  65.13553
69   -4.43144025  54.48420
70   56.61254937  89.51916
71   27.47886131  61.87440
72  -19.44662374  52.02144
73   64.47625912 147.67837
74   55.02383401 100.35236
75   33.39933865  65.19841
76   62.48135330 132.62954
77   52.88745943  85.44499
78   24.77981272  56.77419
79   60.22632077 117.84084
80   49.85317251  71.43554
81   13.30396146  51.20631
82   57.49766992 103.52575
83   43.87488258  60.37009
84    0.01606245  47.45047
85   53.83616745  90.14351
86   32.65903563  54.54220
87  -14.21625030  44.63904
88   48.31582373  78.62012
89   18.60041823  51.55708
90  -28.94176476  42.32082
91   52.53301216 138.25487
92   41.86895677  92.14048
93   19.05380896  58.17718
94   51.07126189 123.10708
95   40.12964136  77.27025
96   11.64180439  48.97965
97   49.37392491 108.19487
98   37.53586783  63.25449
99    1.46688467  42.54503
100  47.23918771  93.72007
101  32.43001438  51.75080
102 -10.84741997  38.24979
103  44.21056080  80.13916
104  22.43925789  45.13202
105 -24.45103698  35.24387
106  39.28346208  68.45671
107   9.00017404  41.96156
108 -38.77703999  32.96033
109  80.19434279 163.00188
110  71.19816024 115.21962
111  50.30146857  79.33787
112  77.67978982 148.03850
113  68.39078396 100.54906
114  40.73155737  71.42984
115  74.81656244 133.42379
116  64.31144925  87.15046
117  28.41518449  66.26828
118  71.34755394 119.41486
119  56.98841393  76.99556
120  14.54798949  62.65754
121  66.78446468 106.50001
122  45.30937942  71.19665
123  -0.08365978  59.81125
124  60.32925968  95.47728
125  31.27113185  67.75697
126 -15.10869229  57.35835
127  64.29051123 148.25424
128  54.57928038 101.18702
129  32.76760840  66.22025
130  62.14047066 133.36054
131  52.02768585  86.69488
132  23.97387226  57.97025
133  59.66862592 118.78864
134  48.27125357  73.40757
135  12.49451804  52.40586
136  56.63114128 104.78239
137  41.56646887  63.06862
138  -0.72205922  48.57870
139  52.54821749  91.82158
140  30.63001834  56.96133
141 -14.87845447  45.69136
142  46.58425260  80.74180
143  17.07475847  53.47285
144 -29.54111933  43.31029
145  47.13168980 134.76158
146  35.89776934  89.21706
147  12.77123971  55.56514
148  45.37201327 119.91172
149  33.44763752  75.05765
150   5.06052664  46.66632
151  43.25993224 105.41426
152  29.74862364  62.14712
153  -5.13820088  40.25550
154  40.53734479  91.52730
155  23.52081970  51.76538
156 -17.29775819  35.80552
157  36.71199500  78.74311
158  13.72046253  44.95620
159 -30.70981809  32.60804
160  30.96284757  67.88272
161   1.13751659  40.92961
162 -44.86621524  30.15489
163  75.28800575  97.07236
164  70.28992530  85.86215
165  61.81452243  78.12927
166  74.66072600  92.89973
167  70.09288591  81.25929
168  61.00581679  74.13807
169  73.56510046  89.19546
170  69.38942706  77.16284
171  59.42124390  70.92274
172  71.74330705  86.21735
173  67.35103661  74.40133
174  56.73764084  68.80644
175  69.02672040  84.13403
176  63.66659402  73.28587
177  53.08218849  67.66199
178  65.51285502  82.84799
179  59.20428130  72.94828
180  48.84760371  67.09667
181  67.95938777  92.31895
182  61.26052721  82.80952
183  51.77490912  76.08686
184  68.10031989  87.81231
185  61.48118513  78.22316
186  51.40678530  72.08927
187  68.04644531  83.50048
188  61.51874229  73.81989
189  50.62719416  68.50316
190  67.58681940  79.59440
191  61.08665680  69.88627
192  49.22105647  65.54359
193  66.29338911  76.52212
194  59.47482678  67.13240
195  47.00815892  63.39078
196  63.74397550  74.70583
197  56.24906177  65.99246
198  43.99655952  62.03667
199  56.46371167  91.73259
200  48.54479882  83.44322
201  38.86589433  76.91384
202  57.37374090  86.89105
203  49.34231804  78.71419
204  39.26666196  72.58156
205  58.17357705  82.15971
206  49.97739550  74.14761
207  39.39123725  68.52548
208  58.76581859  77.63596
209  50.31952718  69.87397
210  39.10452447  64.88068
211  58.92906227  73.54121
212  50.11167242  66.15031
213  38.22958746  61.82411
214  58.19417998  70.34458
215  48.94054508  63.38993
216  36.59726435  59.52493
217  80.89192563 101.14327
218  76.61243692  89.21448
219  68.42092267  81.19771
220  79.98697427  97.24832
221  76.30817273  84.71884
222  67.42777345  77.39095
223  78.48247952  93.95291
224  75.10480226  81.12231
225  65.29701162  74.72181
226  76.16500489  91.47048
227  71.99794004  79.42926
228  61.86637828  73.35254
229  73.01058950  89.82499
230  67.66161526  78.96568
231  57.65007324  72.76894
232  69.21052193  88.82516
233  62.94685139  78.88054
234  53.07852751  72.54058
235  69.50058087  91.16786
236  63.67912628  80.78104
237  54.49240553  73.75947
238  69.34370135  86.95904
239  63.84109614  76.25336
240  53.98116192  69.90501
241  68.78834944  83.14868
242  63.65891750  72.06983
243  52.84994666  66.67052
244  67.51612338  80.05520
245  62.42142577  68.94162
246  50.80357498  64.35118
247  65.21281667  77.99280
248  59.33415972  67.66318
249  47.79259523  62.99646
250  61.92799764  76.91192
251  55.16515728  67.46647
252  44.07565443  62.34769
253  54.87229086  84.42940
254  47.66383126  75.42958
255  38.24395242  68.64117
256  55.66600928  79.70417
257  48.48391698  70.67798
258  38.60183035  64.35178
259  56.24926786  75.18941
260  49.08728009  66.14311
261  38.57014051  60.45197
262  56.39985505  71.10731
263  49.20911749  62.08977
264  37.91133147  57.17927
265  55.65238758  67.92327
266  48.26137912  59.10600
267  36.35488691  54.80420
268  53.46745540  66.17669
269  45.67671613  57.75915
270  33.78798178  53.43960
271  85.76337994 105.94665
272  81.90243600  93.59931
273  74.12383063  85.16964
274  84.44208620 102.46804
275  80.92152108  89.78033
276  72.56270945  81.93085
277  82.45207685  99.65815
278  78.65902971  87.24291
279  69.67339707  80.02026
280  79.70102950  97.60929
281  75.01859516  86.08344
282  65.74060644  79.15315
283  76.26889307  96.24153
284  70.65570431  85.64643
285  61.29652348  78.79733
286  72.33734332  95.37317
287  65.99719246  85.50504
288  56.62196581  78.67198
289  69.37249357  91.68606
290  64.16480213  80.68547
291  55.60735537  73.03463
292  68.54363490  88.14922
293  63.53080089  76.95377
294  54.52707043  69.74921
295  67.13649600  85.19065
296  62.09055903  74.02830
297  52.62713376  67.28344
298  64.99995841  82.96148
299  59.55333408  72.19982
300  49.81958355  65.72529
301  62.12354057  81.47219
302  56.07060565  71.31684
303  46.28547535  64.89369
304  58.64671152  80.58332
305  52.02335189  70.99839
306  42.28658924  64.52687
307  49.19808592  81.20900
308  42.56220412  71.63659
309  33.81564597  64.17487
310  49.04499204  77.43058
311  42.44710881  67.82018
312  33.45025094  60.60875
313  48.39307777  74.15099
314  41.77765007  64.55813
315  32.51214334  57.61535
316  47.08045511  71.53210
317  40.35325688  62.05102
318  30.85623261  55.33976
319  45.00175376  69.67930
320  38.06364675  60.40912
321  28.42988245  53.83460
322  42.17662696  68.57291
323  34.97700887  59.56425
324  25.30970873  53.02326
325  26.03761043  94.83657
326  52.24436486  92.99169
327  64.32030044 105.27763
328  35.58746979  93.16483
329  62.08237018  91.03181
330  69.69139098 107.78466
331  44.58937666  92.04106
332  70.87263995  90.11967
333  72.80660733 112.54758
334  52.61488738  91.89367
335  76.41215181  92.45829
336  73.91018772 119.32212
337  58.95504100  93.43165
338  77.20332813  99.54524
339  73.63034521 127.48010
340  62.89072911  97.37409
341  75.76421074 108.86248
342  72.50991564 136.47865
343  19.35528135  89.43686
344  44.25937854  88.89464
345  55.50232704 102.01357
346  29.26125405  87.84322
347  54.16666157  87.29968
348  61.38588298 104.44234
349  38.72547206  86.69133
350  63.14469703  86.63398
351  65.31665565 108.82389
352  47.38342849  86.34570
353  69.63601434  88.45498
354  67.34988294 115.10299
355  54.55977251  87.48168
356  71.96534288  94.43798
357  67.92218545 122.84302
358  59.38049711  90.97328
359  71.51213344 103.20352
360  67.52043214 131.55709
361  10.86760624  85.84251
362  33.81547370  87.25652
363  44.51779944 100.91607
364  21.20313818  84.25350
365  43.78857556  86.02994
366  51.18928905 102.99110
367  31.16043283  83.04273
368  52.88867280  85.67636
369  56.22260105 106.70431
370  40.43613121  82.51355
371  60.16462519  87.14694
372  59.49430504 112.17913
373  48.44437836  83.25183
374  64.39663817  91.66145
375  61.20441563 119.21554
376  54.29245961  86.15027
377  65.66158452  99.14302
378  61.73052689 127.43596
379  31.17626692  99.37274
380  57.81047367  97.10041
381  70.09956037 109.17320
382  40.64503663  97.78210
383  67.66995596  95.11906
384  75.32460638 111.82628
385  49.52435958  96.78091
386  76.41926539  94.24788
387  78.19710970 116.83191
388  57.36626603  96.81713
389  81.57656838  96.96870
390  79.05058719 123.85656
391  63.46508411  98.59644
392  81.91315563 104.51025
393  78.57307920 132.21220
394  67.17442023 102.76523
395  80.27629210 114.02524
396  77.31066671 141.35274
397  20.35437721  88.82788
398  46.25820910  87.28592
399  57.92899153  99.97702
400  30.12853375  87.36605
401  56.30070153  85.55576
402  63.57910300 102.63923
403  39.38694821  86.41996
404  65.37741462  84.79137
405  67.06750063 107.46316
406  47.71708778  86.40215
407  71.39193280  87.08918
408  68.58302034 114.25997
409  54.41234464  88.01922
410  72.62216767  94.17127
411  68.70101251 122.45430
412  58.70784836  92.03604
413  71.49998080 103.60578
414  67.94975527 131.51788
415   8.30129461  79.51421
416  32.80970283  79.36768
417  44.00998545  92.52927
418  18.52026178  78.04177
419  43.04464241  77.87926
420  50.42689227  94.85889
421  28.27881663  77.02974
422  52.36957380  77.30085
423  54.89944025  99.13286
424  37.19822262  76.85685
425  59.30899212  79.10796
426  57.40347215 105.37535
427  44.58358979  78.21801
428  62.12811584  85.03536
429  58.33859975 113.18675
430  49.55637253  81.99175
431  61.98787052  93.92213
432  58.20769668 122.06418
433  36.09019772 104.13365
434  63.00074575 101.58498
435  75.51539192 113.43221
436  45.43357943 102.66840
437  72.71781327  99.74604
438  80.55694030 116.26879
439  54.13429063 101.84581
440  81.12998337  99.21200
441  83.19703706 121.50682
442  61.73253981 102.12569
443  85.77270299 102.44740
444  83.84887223 128.73311
445  67.55509485 104.18127
446  85.96830270 110.12993
447  83.22911740 137.23099
448  71.05892298 108.55557
449  84.35174741 119.62462
450  81.87181854 146.46642
451  20.80768377  88.76469
452  47.43137897  86.50287
453  59.62167848  98.67444
454  30.33832200  87.54637
455  57.28258064  84.96399
456  64.93393509 101.67451
457  39.24569824  86.95132
458  65.86671448  84.69218
459  67.94705747 106.97372
460  47.08445672  87.42489
461  70.90633121  87.96489
462  69.00710161 114.22600
463  53.19454578  89.62713
464  71.59126253  95.59229
465  68.77791915 122.76751
466  57.02128684  94.11272
467  70.35855684 105.13732
468  67.78287712 132.07488
469   3.95456989  74.96632
470  29.27683964  74.00593
471  41.19035354  86.45429
472  13.70225736  73.96516
473  38.95654561  73.07275
474  47.06312551  89.32804
475  22.79595905  73.61798
476  47.25920612  73.51661
477  50.85401956  94.28367
478  30.80610246  74.35436
479  52.74763953  76.77470
480  52.71269091 101.17153
481  37.12714702  76.77984
482  54.75162658  83.51724
483  53.15203037 109.47871
484  41.26102828  81.39249
485  54.53446688  92.48092
486  52.67559175 118.70167

1.6 Q1.6

An interpretation of the meaning of the model by writing a scientific abstract. (<150 words) (10pts)

  • BACKGROUND: brief intro of the study background, what are the existing findings: In the late 19th century, Switzerland underwent a demographic transition marked by declining fertility. Understanding how cultural, religious, and economic factors influenced fertility offers insights into population change and informs modern demographic strategies.

  • OBJECTIVE: state the overall purpose of your research, e.g., what kind of knowledge gap you are trying to fill in:

This study examines how socioeconomic and cultural variables—Agriculture, Examination, Education, Catholicism, and Infant Mortality—are associated with fertility rates.

  • METHODS: study design (how these data were collected), outcome definitions, statistical procedures used

Using the swiss dataset (47 French-speaking provinces, 1888), we performed exploratory analysis, built a full linear model with two-way interactions, applied stepwise AIC selection, and evaluated predictor transformations (log, polynomial, B-splines). Model assumptions were checked via residual plots, Shapiro-Wilk, and Breusch-Pagan tests.

  • RESULTS: summary of major findings to address the question raised in objective

The final model (Adjusted R² = 0.72; AIC = 321.45) retained predictors in original scale, including four key interaction terms. Transformations did not improve fit.

  • CONCLUSIONS: Multivariable modeling reveals that fertility is influenced by complex interactions among education, religion, and health—offering insight for demographic and policy planning.

2 Q2.(70pts)

The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. The data may be found in the the dataset pima.

data(pima)
str(pima)
'data.frame':   768 obs. of  9 variables:
 $ pregnant : int  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose  : int  148 85 183 89 137 116 78 115 197 125 ...
 $ diastolic: int  72 66 64 66 40 74 50 0 70 96 ...
 $ triceps  : int  35 29 0 23 35 0 32 0 45 0 ...
 $ insulin  : int  0 0 0 94 168 0 88 0 543 0 ...
 $ bmi      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ diabetes : num  0.627 0.351 0.672 0.167 2.288 ...
 $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
 $ test     : int  1 0 1 0 1 0 1 0 1 1 ...
summary(pima)
    pregnant         glucose        diastolic         triceps     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    insulin           bmi           diabetes           age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
      test      
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  

2.1 Q2.1

Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot? (5pts)

Solution

pima$test <- factor(pima$test, levels = c(0,1), labels = c("Negative", "Positive"))

Producing an Interleaved Histogram:

ggplot(pima, aes(x = insulin, fill = test)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  labs( 
    title = "Insulin Levels by Test Result",
    x = "Insulin",
    y = "Count",
    fill = "Test Result"
  ) +
  theme_minimal()

Abnormalities of the graph: The histogram displays a large concentration of insulin values at 0, which is highly improbable for human insulin levels. This anomaly likely indicates missing or unrecorded measurements that were incorrectly entered as zero, rather than representing true physiological data. Additionally, the graph reveals the presence of extreme outliers, with some insulin values exceeding 750. Such values are not biologically plausible and may be the result of data entry errors or measurement inaccuracies.

2.2 Q2.2

Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution. (5pts)

pima$insulin_clean <- ifelse(pima$insulin == 0, NA, pima$insulin)
ggplot(pima, aes(x = insulin_clean, fill = test)) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.7, na.rm = TRUE) +
  labs(
    title = "Distribution of Insulin Levels by Diabetes Test Result (Excluding Zeros)",
    x = "Insulin",
    y = "Count",
    fill = "Test Result"
  ) +
  theme_minimal() 

Creating the interleaved histogram plot: Amongst the Pima Indians who tested negative (pink), the majority of individuals have insulin levels clustered between 0-150. There is a peak at around insulin levels of 100-125. In addition, the distribution of insulin levels is right skewed where there are fewer people with higher insulin levels. There are a couple of outliers of individuals whose insulin levels are above 300. Amongst the Pima Indians who tested positive (blue), the majority of individuals have insulin levels clustered between 100-250. The distribution is more spread out compard to the distribution corresponding to the inuslin levels of the Pima Indians who tested negative. Compared to the negative group, the positive group has a longer right tail, which suggests that some individuals who tested positive have very high insulin levels of above 500. We are able to see that the spike at insulin = 0 that we saw in the previous graph is gone because these 0 values were replaced with NA’s which were omitted in the plot. The histogram, therefore, reflects only a fraction of the data which may lead to bias interpretations.

2.3 Q2.3

Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame. (10pts)

Solution We observe from the output of summary(pima) that several predictor variables have a minimum value of 0, specifically: glucose, diastolic, triceps, insulin, and bmi. These values are medically implausible and likely represent missing data, so we will convert them to N/A. However, we will not convert the 0 values in the pregnant variable, as it is perfectly reasonable for individuals in the dataset to have had zero pregnancies and is therefore not indicative of missing data.

vars_to_clean <- c("glucose", "diastolic", "triceps", "insulin", "bmi")

pima_clean <- pima
pima_clean[vars_to_clean] <- lapply(pima[vars_to_clean], function(x) ifelse(x == 0, NA, x))
full_model <- glm(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age,
                 family = binomial,
                 data = pima_clean)

summary(full_model)

Call:
glm(formula = test ~ pregnant + glucose + diastolic + triceps + 
    insulin + bmi + diabetes + age, family = binomial, data = pima_clean)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
pregnant     8.216e-02  5.543e-02   1.482  0.13825    
glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
diastolic   -1.420e-03  1.183e-02  -0.120  0.90446    
triceps      1.122e-02  1.708e-02   0.657  0.51128    
insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
bmi          7.054e-02  2.734e-02   2.580  0.00989 ** 
diabetes     1.141e+00  4.274e-01   2.669  0.00760 ** 
age          3.395e-02  1.838e-02   1.847  0.06474 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 344.02  on 383  degrees of freedom
  (376 observations deleted due to missingness)
AIC: 362.02

Number of Fisher Scoring iterations: 5
n_used <- nobs(full_model)
n_total <- nrow(pima_clean)

cat("Observations used in model fitting:", n_used, "\n",
    "Total observations in data frame:", n_total, "\n",
    "Observations omitted due to missing values:", n_total - n_used, "\n")
Observations used in model fitting: 392 
 Total observations in data frame: 768 
 Observations omitted due to missing values: 376 

The number of observations used in the model fitting is 392. This is fewer than the 768 observations in the original data frame because the glm() function uses only rows with no missing values in any predictor (NA’s are excluded). The original dataset included rows with invalid zero values in variables such as glucose, diastolic, triceps, insulin, and bmi, which are not physiologically realistic. These zeroes were replaced with NA, and the rows containing them were excluded during model fitting to ensure data quality. It is important to note that the number of observations that were omitted due to missing values is 376 which makes up a significant portion of the dataset. Using such data may cause bias interpretation, as mentioned below and so it is important to use imputation methods in later interpretations.

2.4 Q2.4

Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question. (10pts)

model_reduced <- glm(test ~ pregnant + glucose + diastolic + bmi + diabetes + age,
                     family = binomial(link = "logit"),
                     data = pima_clean) 

summary(model_reduced)

Call:
glm(formula = test ~ pregnant + glucose + diastolic + bmi + diabetes + 
    age, family = binomial(link = "logit"), data = pima_clean)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.962146   0.820892 -10.918  < 2e-16 ***
pregnant     0.117863   0.033418   3.527  0.00042 ***
glucose      0.035194   0.003605   9.763  < 2e-16 ***
diastolic   -0.008916   0.008618  -1.035  0.30084    
bmi          0.090926   0.015740   5.777 7.61e-09 ***
diabetes     0.960515   0.306415   3.135  0.00172 ** 
age          0.016944   0.009834   1.723  0.08489 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 931.94  on 723  degrees of freedom
Residual deviance: 672.86  on 717  degrees of freedom
  (44 observations deleted due to missingness)
AIC: 686.86

Number of Fisher Scoring iterations: 5
n_used_reduced <- nobs(model_reduced)
cat("Observations used in reduced model that removed insulin and triceps:", n_used_reduced, "\n")
Observations used in reduced model that removed insulin and triceps: 724 
cat("Observations used in Full Model:", n_used, "\n") 
Observations used in Full Model: 392 
cat("Total observations in data frame:", n_total, "\n")
Total observations in data frame: 768 

We refit the model without the insulin and triceps predictors to create a reduced model. The number of observations that were used in this reduced model is 724 which is greater than the number of observations that were used in the full model (392 observations). It is possible that the reduced model has more observations than that of the full model because the reduced model did not drop the rows where there were NA values in the insulin and triceps columns.

update_nested <- function(object, formula., ..., evaluate = TRUE) {
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}

full_model <- glm(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age,
                  family = binomial, data = pima_clean)

reduced_model <- update_nested(full_model, . ~ . - insulin - triceps)

lrt_result <- lrtest(full_model, reduced_model)
print(lrt_result)
Likelihood ratio test

Model 1: test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + 
    diabetes + age
Model 2: test ~ pregnant + glucose + diastolic + bmi + diabetes + age
  #Df  LogLik Df  Chisq Pr(>Chisq)
1   9 -172.01                     
2   7 -172.44 -2 0.8593     0.6507

Test to compare this model to the previous model (that included insulin and triceps): Because the reduced and full models contain a different number of observations, we will use the update_nested function to ensure that the reduced model is fitted on the exact same 392 observations as the full model, which makes the comparison valid. We use the likelihood ratio test to compare the two models. The likelihood ratio test compares the goodness of fit of two models and checks whether removing the two predictors - insulin and triceps significantly worsens the model’s fit where the null hypothesis is that the reduced model fits the data just as well as the full model. The p-value is 0.6507 (larger than 0.05) so we fail to reject the null hypothesis. Therefore, we have sufficient evidence to conclude the reduced model (without the 2 predictors insulin and triceps) fits the data just as well as the full model and do not significantly contribute to the model when the other predictors are already included in the model. Therefore, the simpler model (reduced model without the insulin and triceps predictors) is preferred.

2.5 Q2.5

Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model? (10pts)

Solution

We perform step-wise selection, adding or dropping predictors to minimize the AIC. We start with the full model that contains all the predictors (pregnant, glucose, diastolic, tricpes, insulin, bmi, diabetes, and age) after removing all the rows that contain NA values. The model selection process of AIC balances the goodness of fit (how well the model explains the data) and the model complexity (penalizes the number of predictors). The lower the AIC, the better the model. Once we start with the full model, we iteratively add or remove predictors (both forward seelctin and backward elimination).

vars_to_clean <- c("glucose", "diastolic", "triceps", "insulin", "bmi")
pima_clean <- pima
pima_clean[vars_to_clean] <- lapply(pima[vars_to_clean], function(x) ifelse(x == 0, NA, x))
pima_complete <- na.omit(pima_clean)
n_complete <- nrow(pima_complete)
cat("Number of complete cases:", n_complete, "\n")
Number of complete cases: 392 
full_model <- glm(test ~ pregnant + glucose + diastolic + triceps + insulin + 
                  bmi + diabetes + age,
                  family = binomial,
                  data = pima_complete)

selected_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(selected_model)

Call:
glm(formula = test ~ pregnant + glucose + bmi + diabetes + age, 
    family = binomial, data = pima_complete)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.992080   1.086866  -9.193  < 2e-16 ***
pregnant     0.083953   0.055031   1.526 0.127117    
glucose      0.036458   0.004978   7.324 2.41e-13 ***
bmi          0.078139   0.020605   3.792 0.000149 ***
diabetes     1.150913   0.424242   2.713 0.006670 ** 
age          0.034360   0.017810   1.929 0.053692 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 344.89  on 386  degrees of freedom
AIC: 356.89

Number of Fisher Scoring iterations: 5
cat("\nSelected predictors:", names(coef(selected_model))[-1], "\n")

Selected predictors: pregnant glucose bmi diabetes age 
cat("Number of cases used in selected model:", nobs(selected_model), "\n")
Number of cases used in selected model: 392 
cat("AIC of selected model:", AIC(selected_model), "\n")
AIC of selected model: 356.8851 

The model with the lowest AIC value comes out to be the one with the following variables: pregnant + glucose + bmi + diabetes + age. The selected model used 5 predictor variables out of 8. The number of cases used in the selected model is 392 (utilized the dataset that removed all the rows containing the NA values for each predictor variable.) The AIC value of the selected model is 356.8851. The selected model can be characterized as: \[y = -9.992 + 0.084X_1 + 0.036X_2 + 0.078X_3 + 1.151X_4 + 0.034X_5\]

2.6 Q2.6

Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this. (10pts)

Solution

First, we create a missingness indicator where 1 means that the row has a missing value (NA) and 0 means that the row does not have any missing values. We indicate this missingness indicator as has_missing. Note that we utilize the pima_clean dataset which is the data after all the 0 values are converted into N/A values. We then use the has_missing variable as a predictor of the test result and we notice that the p-value for the variable is 0.304. This signifies that the cases with missing values are not significantly more likely to test positive for diabetes than complete cases. In other words, we don’t have evidence to conclude that the presence of missing data is associated with the test outcome.

pima_clean$has_missing <- ifelse(rowSums(is.na(pima_clean[vars_to_clean])) > 0, 1, 0)

missing_test <- glm(test ~ has_missing, 
                   family = binomial, 
                   data = pima_clean)
summary(missing_test)

Call:
glm(formula = test ~ has_missing, family = binomial, data = pima_clean)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.7008     0.1073  -6.533 6.47e-11 ***
has_missing   0.1558     0.1515   1.028    0.304    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 992.43  on 766  degrees of freedom
AIC: 996.43

Number of Fisher Scoring iterations: 4

Next, we will refit the selected model using as much of the data as reasonable. Since missingness isn’t significantly associated with the outcome, it is reasonable to assume that the data is missing at random. Therefore, instead of removing rows with missing values (which reduces sample size and power), we can impute the missing values with the mean/median or use other imputation methods such as MICE or keep as many observations as possible where variables required for the model are present. This ensures that more data is retained, increasing the statistical power of the final model. In addition, we are able to avoid any biases introduced by only analyzing complete cases, especialy if missingness isn’t strongly related to the “test” response outcome.

The next chunk of code performs multiple imputation to handle missing data and then analyzes the imputed datasts. We first create 5 imputed datasets to account for the missing values using the (MICE or Multivariate Imputation by Chained Equations) algorithm where the algorithm predicts the missing values based on observed data and variable relationships. We will then refit the previously selected model (logistic regression with the predictors pregnant + glucose + bmi + diabetes + age) on each imputed dataset (5 total datasets). The results from all the of the 5 models are combined into a single pooled estimate.

vars_to_impute <- c("test", "pregnant", "glucose", "bmi", "diabetes", "age", 
                    "diastolic", "triceps", "insulin")
imp <- mice(pima_clean[vars_to_impute], m = 5, method = "pmm", printFlag = FALSE)

models <- with(imp, glm(test ~ pregnant + glucose + bmi + diabetes + age, 
                       family = binomial))
pooled_results <- pool(models)
summary(pooled_results)
         term    estimate   std.error  statistic       df      p.value
1 (Intercept) -9.36631167 0.732121627 -12.793382 754.7766 4.654100e-34
2    pregnant  0.12321462 0.032199790   3.826566 759.5399 1.406273e-04
3     glucose  0.03613603 0.003549391  10.180910 757.0814 6.658953e-23
4         bmi  0.08884115 0.014702628   6.042535 757.4560 2.378420e-09
5    diabetes  0.86655746 0.295531687   2.932198 757.9857 3.467093e-03
6         age  0.01037641 0.009185617   1.129637 759.1050 2.589859e-01

After refitting the selected model with the imputed datasets, we get the following model: \[y = -9.3813 + 0.1230X_1 + 0.0362X_2 + 0.0885X_3 + 0.8760X_4 + 0.0108X_5\] ### Q2.7

Using the last fitted model of the previous question [model selected in Question 2.5], compute the odds ratio of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Give a confidence interval for this odds ratio. (10pts)

Solution

We are utilizing the data set after accounting for the missing values (after removing all NA values). The

bmi_coef <- coef(selected_model)["bmi"]
bmi_se <- summary(selected_model)$coefficients["bmi", "Std. Error"]

bmi_quartiles <- quantile(pima_complete$bmi, probs = c(0.25, 0.75), na.rm = TRUE)
q1 <- bmi_quartiles[1] 
q3 <- bmi_quartiles[2] 
bmi_quartiles
 25%  75% 
28.4 37.1 
bmi_diff <- (bmi_coef*(q3 - q1))
or <- exp(bmi_diff) 

ci <- exp(bmi_diff + c(-1.96, 1.96) * bmi_se * (q3 - q1))
sprintf("OR: %.2f (95%% CI: %.2f-%.2f)", or, ci[1], ci[2])
[1] "OR: 1.97 (95% CI: 1.39-2.80)"

The odds ratio of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant is 1.973495. In other words, a woman with a BMI at the third quartile (37.1) has 1.9743 times the odds of testing poistive for diabetes compared to a woman with a BMI at the 1st quartile (28.4), assuming that all other predictors (pregnancies, glucose, and age are held constant). The 95 percent confidence interval for the odds ratio is (1.39-2.80). In other words, we are 95 percent confident that the true odds ratio for diabetes risk (comparing Q3 versus Q1 BMI) lies between 1.36 and 2.64 in the population.

2.7 Q2.8

Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory. (10pts)

Solution In order to check if women who test positive have higher diastolic blood pressures on average, we will compare the mean/median values of the diastolic blood pressures between the two testing groups (those who tested positive versus negative) by using a t-test or the Wilcoxon test. We will first check if the distributions of the Diastolic blood pressures within each group (Negative/Positive Tests) are normally distributed or not.

hist(pima_clean$diastolic[pima_clean$test == "Negative"], main = "Distribution of Diastolic BP (Negative Test)")

hist(pima_clean$diastolic[pima_clean$test == "Positive"], main = "Distribution of Diastolic BP (Positive Test)")

qqnorm(pima_clean$diastolic[pima_clean$test == "Negative"]); qqline(pima_clean$diastolic[pima_clean$test == "Negative"])

qqnorm(pima_clean$diastolic[pima_clean$test == "Positive"]); qqline(pima_clean$diastolic[pima_clean$test == "Positive"])

We are able to see that the distributions are normally distributed and so we are able to use the t-test (parametric test) to compare the means of the diastolic blood pressures between the two groups (Positive and Negative tests).

t.test(diastolic ~ test, data = pima_clean)  

    Welch Two Sample t-test

data:  diastolic by test
t = -4.6643, df = 504.72, p-value = 3.972e-06
alternative hypothesis: true difference in means between group Negative and group Positive is not equal to 0
95 percent confidence interval:
 -6.316023 -2.572156
sample estimates:
mean in group Negative mean in group Positive 
              70.87734               75.32143 

We are able to see from the results of the two sample t-test that the difference in the means of the Diastolic Blood Pressure between the two groups (Positive and Negative) is statistically significant because the p-value is smaller than 0.05. The means in each group are as follows: Negative group (the diastolic blood pressure mean is 70.87) and the Positive group (the diastolic blood pressure mean is 75.32143.) Therefore, we are able to conclude that women who test positive do have higher diastolic blood pressures.

Next, we check whether the predictor variable diastolic blood pressure is significant in the regression model. The regression model that is utilized in this question is the best model that was found in 2.5 with the diastolic blood pressure predictor variable added. The p-value corresponding to diastolic blood pressure is 0.945949 which is not significant in the full model. This means that DBP does not contribute significantly to the regression model when other variables are included.

model_with_dbp <- update(selected_model, . ~ . + diastolic)
summary(model_with_dbp)

Call:
glm(formula = test ~ pregnant + glucose + bmi + diabetes + age + 
    diastolic, family = binomial, data = pima_complete)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.9604796  1.1818764  -8.428  < 2e-16 ***
pregnant     0.0840497  0.0550728   1.526 0.126971    
glucose      0.0364863  0.0049973   7.301 2.85e-13 ***
bmi          0.0785728  0.0215674   3.643 0.000269 ***
diabetes     1.1492368  0.4250340   2.704 0.006854 ** 
age          0.0346079  0.0181919   1.902 0.057121 .  
diastolic   -0.0008002  0.0118034  -0.068 0.945949    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 344.88  on 385  degrees of freedom
AIC: 358.88

Number of Fisher Scoring iterations: 5

The first question examines a univariate association (diastolic blood pressure versus the test outcome while not taking account for the other predictors in the model) while the second question examines a multivariate association (the diastolic blood pressure’s contribution after accounting for other predictors). Although the t-test shows that the diastolic blood pressure differs between the groups in isolation, the regression model shows that the diastolic blood pressure predictor variables does not provide additional predictive power beyond the other variables in the model. Therefore, the answers seem contradictory because in the t-test we were able to see that the diastolic blodo pressure alone is associated with the test outcome. In the regression, diastolic blood pressure may be correlated with other predictors, therefore making it insiginificant. In addition, even though the mean difference between the diastolic blood pressures between the two testing groups may be statistically significant, it doesn’t improve the prediction meaningfully in a multivariate model.

END